Linear Model¶
See the backing repository for Linear Model here.
Summary¶
Linear / logistic regression, where the relationship between the response and its explanatory variables are modeled with linear predictor functions. This is one of the foundational models in statistical modeling, has quick training time and offers good interpretability, but has varying model performance. The implementation is a light wrapper to the linear / logistic regression exposed in scikit-learn.
How it Works¶
Christoph Molnar’s “Interpretable Machine Learning” e-book [1] has an excellent overview on linear and regression models that can be found here and here respectively.
For implementation specific details, scikit-learn’s user guide [2] on linear and regression models are solid and can be found here.
Code Example¶
The following code will train a logistic regression for the breast cancer dataset. The visualizations provided will be for both global and local explanations.
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from interpret.glassbox import LogisticRegression
from interpret import show
seed = 1
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
lr = LogisticRegression(random_state=seed)
lr.fit(X_train, y_train)
lr_global = lr.explain_global()
show(lr_global)
lr_local = lr.explain_local(X_test[:5], y_test[:5])
show(lr_local)
c:\users\sajenkin\appdata\local\continuum\anaconda3\envs\interpret\envs\docs\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Further Resources¶
Bibliography¶
- 1
Christoph Molnar. Interpretable machine learning. Lulu. com, 2020.
- 2
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. Scikit-learn: machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
API¶
LinearRegression¶
-
class
interpret.glassbox.LinearRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._coordinate_descent.Lasso'>, **kwargs)¶ Initializes class.
- Parameters
feature_names – List of feature names.
feature_types – List of feature types.
linear_class – A scikit-learn linear class.
**kwargs – Kwargs pass to linear class at initialization time.
-
explain_global(name=None)¶ Provides global explanation for model.
- Parameters
name – User-defined explanation name.
- Returns
An explanation object, visualizing feature-value pairs as horizontal bar chart.
-
explain_local(X, y=None, name=None)¶ Provides local explanations for provided instances.
- Parameters
X – Numpy array for X to explain.
y – Numpy vector for y to explain.
name – User-defined explanation name.
- Returns
An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.
-
fit(X, y)¶ Fits model to provided instances.
- Parameters
X – Numpy array for training instances.
y – Numpy array as training labels.
- Returns
Itself.
-
predict(X)¶ Predicts on provided instances.
- Parameters
X – Numpy array for instances.
- Returns
Predicted class label per instance.
-
score(X, y, sample_weight=None)¶ Return the coefficient of determination \(R^2\) of the prediction.
The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred) ** 2).sum()and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted), wheren_samples_fittedis the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – \(R^2\) of
self.predict(X)wrt. y.- Return type
float
Notes
The \(R^2\) score used when calling
scoreon a regressor usesmultioutput='uniform_average'from version 0.23 to keep consistent with default value ofr2_score(). This influences thescoremethod of all the multioutput regressors (except forMultiOutputRegressor).
LogisticRegression¶
-
class
interpret.glassbox.LogisticRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._logistic.LogisticRegression'>, **kwargs)¶ Initializes class.
- Parameters
feature_names – List of feature names.
feature_types – List of feature types.
linear_class – A scikit-learn linear class.
**kwargs – Kwargs pass to linear class at initialization time.
-
explain_global(name=None)¶ Provides global explanation for model.
- Parameters
name – User-defined explanation name.
- Returns
An explanation object, visualizing feature-value pairs as horizontal bar chart.
-
explain_local(X, y=None, name=None)¶ Provides local explanations for provided instances.
- Parameters
X – Numpy array for X to explain.
y – Numpy vector for y to explain.
name – User-defined explanation name.
- Returns
An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.
-
fit(X, y)¶ Fits model to provided instances.
- Parameters
X – Numpy array for training instances.
y – Numpy array as training labels.
- Returns
Itself.
-
predict(X)¶ Predicts on provided instances.
- Parameters
X – Numpy array for instances.
- Returns
Predicted class label per instance.
-
predict_proba(X)¶ Probability estimates on provided instances.
- Parameters
X – Numpy array for instances.
- Returns
Probability estimate of instance for each class.
-
score(X, y, sample_weight=None)¶ Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – Mean accuracy of
self.predict(X)wrt. y.- Return type
float